### UCSD BASED TECHNIQUE USED VLSI ARCHITECTURE FOR DISCRETE WAVELET TRANSFORMATION

Rakesh Kumar Sadangi<sup>1</sup>

Department of Electronics and Communication, Aryan Institute of Engineering and Technology Bhubnaeswar **Abhishek Das**<sup>2</sup>

Department of Electronics and Communication,

Raajdhani Engineering College, Bhubaneswar

### Manoranjan Sahoo<sup>3</sup>

Department of Electronics and Communication,

Capital Engineering College (CEC), Bhubaneswar

#### Swaha Pattnaik<sup>4</sup>

Department of Electronics and Communication, NM Institute Of Engineering & Technology, Bhubaneswar

#### ABSTRACT

Conventional distributed arithmetic (DA) is popular in field programmable gate array (FPGA) design, and it features on-chip ROM to achieve high speed and regularity. In this paper, we describe high speed area efficient 2-D discrete wavelet transform (DWT) using 9/7 filter based canonic signed digit (CSD) Technique. Being area efficient architecture free of ROM, multiplication, and subtraction, CSD can also expose the redundancy existing in the adder array consisting of entries of 0 and 1. This architecture supports any size of image pixel value and any level of decomposition. The parallel structure has 100% hardware utilization efficiency.

**Key words:** 2-D Discrete Wavelet Transform (DWT), CSD, Low Pass Filter, High Pass Filter, Xilinx Simulation.

**Cite this Article:** Priya Sahu and Dr. Paresh Rawat, VLSI Architecture For Discrete Wavelet Transform Using CSD Based Technique, *International Journal of Electronics and Communication Engineering and Technology*, 7(6), 2016, pp. 48–55. http://www.iaeme.com/IJECET/issues.asp?JType=IJECET&VType=7&IType=6

# **1. INTRODUCTION**

Discrete wavelet transform (DWT) is a mathematical technique that provides a new method for signal processing and decomposes a discrete signal in the time domain by using dilated / contracted and translated versions of a single basis function, named as prototype wavelet [Mallat (1989a) ; Mallat (1989b) ; Daubachies (1992) ; Meyer (1993) ; Vetterli and Kovacevic (1995)]. DWT offers wide variety of useful features over other unitary transforms like discrete Fourier transforms (DFT), discrete cosine transform (DCT) and discrete sine transform (DST). Some of these features are; adaptive time-frequency windows, lower aliasing distortion for signal processing applications, efficient computational complexity and inherent scalability [Grzesczak et al. (1996)]. Due to these features one dimensional (1-D) DWT and two dimensional (2-D) DWT are applied in various application such as numerical analysis [Beylkin et al. (1992)], signal analysis [Akanshu and Haddad (1992)], image coding [Sodagar et al. (1994)] and biomedicine [Senhadji et al. (1994)]. Several algorithms and computation schemes have been suggested during last three decades for efficient hardware implementation of 1-D DWT and 2-D DWT.

The DWT is computationally intensive and most of its application demand real-time processing. One way of achieving high speed performance is to use fast computational algorithm in a general purpose computers. Another way is to exploit the parallelism inherent in the computation for concurrent processing

by a set of parallel processor. But, it is not cost effective to use a general purpose computer for a specific application. Also, general purpose computer used for their implementation required more space, large power and more computation time. With the development of very large scale integration (VLSI) technology it facilitates to digital signal processing (DSP) system designer to design a high performance, low cost and low power system in a single chip. The characteristic of VLSI system are that they offer greater potential for large amount of concurrency and offer an enormous amount of computing power within a small area [Weste and Eshraghian (1993)]. The computation is very cheap as the hardware is not an obstacle for VLSI system. But, the non-localized global communication is not only expensive but demands high power dissipation. Thus, a high degree of parallelism and a nearest neighbor communication are crucial for realization of high performance VLSI system [Kung (1982)]. Keeping this in view, high performance application specific VLSI systems are rapidly evolving in recent years. The special purpose VLSI systems maximize processing concurrency by parallel / pipeline processing and provides cost effective alternative for real- time application. Therefore, 2-D DWT is currently implemented in a VLSI system to meet the temporal requirement of real-time application. Keeping this fact in view, several design schemes have been suggested in the last two decades for efficient implementation of 2-D DWT in a VLSI system. Researchers have adopted different algorithm formulation, mapping scheme, and architectural design methods to reduce the computational time, arithmetic complexity or memory complexity of 2-D DWT structures. However, the area-delay performance of the existing structures changes marginally. This is mainly due to the memory complexity, which forms a major hardware component of folded 2-D DWT structure. A detail study of the existing design methods and a complexity analysis is made in Chapter 2 to find an appropriate design strategy to improve the area-delay performance of 2-D DWT structures.

## 2. MULTILEVEL DISCRETE WAVELET TRANSFORM

Multiresolution analysis (MRA) is a characteristic feature of SB and it is used for better spectral representation of the signal. In MRA, the signal is decomposed for more than one DWT level known as multilevel DWT. It means the low-pass output of first DWT level is further decomposed in a similar manner in order to get the second level of DWT decomposition and the process is repeated for higher DWT levels. Few algorithms have been suggested for computation of multilevel DWT. One of the most important algorithm are pyramid algorithm (PA), this algorithm are proposed Mallet (1989a) for parallel computation of multilevel DWT. PA for 1-D DWT is given by

$$Y_{l}^{j}(n) = \sum_{i=0}^{k-1} h(i) Y_{l}^{j-1}(2n-i)$$
(1)

$$Y_{h}^{j}(n) = \sum_{i=0}^{k-1} g(i) Y_{h}^{j-1} (2n-i)$$
(2)

Where  $Y_l^j(n)$  is the n-th low-pass sub band component of the j-th DWT level and  $Y_h^j(n)$  is the n-th high-pass sub band component of the j-th DWT level. Two-dimensional signal, such as images, are analyzed using the 2-D DWT. Currently 2-D DWT is applied in many image processing applications such as image compression and reconstruction [Lewis and Knowles (1992)], pattern recognition [Kronland *et al.* (1987)], biomedicine [Senhadji *et al.* (1994)] and computer graphics [Meyer (1993)]. The 2-D DWT is a mathematical technique that decomposes an input image in the multiresolution frequency space. The 2-D DWT decomposes an input image into four sub bands known as low-low (LL), low-high (LH), high-low (HL) and high-high (HH) sub band.

| LL <sub>2</sub> | HL₃ |                 |                 |
|-----------------|-----|-----------------|-----------------|
| LH3             | HH3 | HL <sub>2</sub> | HL <sub>1</sub> |
| LH <sub>2</sub> |     | HH₂             |                 |
| LH <sub>1</sub> |     |                 | HH1             |

Figure 1 Three Level Diagram of 2-D Sub-band Wavelet Transform

### **3. PROPOSED ARCHITECTURE**

The block diagram of 9/7 wavelet coefficient based multilevel discrete wavelet transform using CSD structure shown in figure 2. In this figure, input sample passing through 8-bit register after that all symmetrical delay input is add in the equation 3 to equation 7.

| r(1) = X(n) + X(n-6)   | (3) |
|------------------------|-----|
| r(2) = X(n-1) + X(n-5) | (4) |
| r(3) = X(n-2) + X(n-4) | (5) |
| r (4)=X (n-3)          | (6) |

We have used CSD in 9/7 filter to remove multipliers. We have to apply CSD two times get the 1-D 9/7 filter high pass output  $Y_{H1}$  and low pass output  $Y_{L1}$ .



Where  $h_0, h_1, h_2, h_3$ ,  $h_4$  are the Low pass filter coefficients and  $g_0, g_1, g_2, g_3$  are the High pass filter coefficients.

If we take the high pass coefficients  $g_0$ ,  $g_1$ ,  $g_2$  and  $g_3$  applied CSD technique by  $r_1$ ,  $r_2$ ,  $r_3$  and  $r_4$  then we get the high pass output  $Y_{\mu}$  of the 9/7 filter and we take the low pass coefficient  $h_0$ ,  $h_1$ ,  $h_2$ ,  $h_3$ , and  $h_4$ applied CSD technique by  $m_1$ ,  $m_2$ ,  $m_3$ ,  $m_4$  and  $m_5$  then we get the low pass output  $Y_{\mu}$  of the 9/7 filter. Example the low pass output step by step as shown in below:

$$Y_L = \begin{bmatrix} h_0 & h_1 & h_2 & h_3 & h_4 \end{bmatrix} \begin{bmatrix} m_1 \\ m_2 \\ m_3 \\ m_4 \\ m_5 \end{bmatrix}$$

Let  $m_1 = 1$ ,  $m_2 = 2$ ,  $m_3 = 3$ ,  $m_4 = 4$  and  $m_5 = 5$ . Then multiplier row and column and find out the low pass output 122. Where  $h_0$ ,  $h_1$ ,  $h_2$ ,  $h_3$ , and  $h_4$  daubechies 9/7 filter coefficients are 0.6029490, 0.2668444, -0.782232, -0.0168641 and 0.02674875 respectively. All the daubechies 9/7 filter coefficients multiplied by 128 and get the 77, 34, -10, -2 and 3 respectively.

$$Y_{H} = \begin{bmatrix} 77 \ 34 & -10 & -2 \ 3 \end{bmatrix} \bullet \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix} = \begin{bmatrix} 122 \\ 3 \\ 5 \end{bmatrix}$$

We take the low pass coefficients  $h_0$ ,  $h_1$ ,  $h_2$ ,  $h_3$ , and  $h_4$  applied CSD technique by  $m_1$ ,  $m_2$ ,  $m_3$ ,  $m_4$  and  $m_5$  then we get the low pass output  $Y_{L}$  of the 9/7 filter.

Now we can make the DA matrix by the filter coefficients as low pass filter based DA matrix

$$\begin{bmatrix} x_k \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 & 0 & 1 \\ 0 & 1 & 1 & 1 & 1 \\ 1 & 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 \end{bmatrix} \qquad \begin{bmatrix} m_1 + m_5 \\ m_2 \\ m_2 \\ m_3 \\ m_1 + m_4 \\ m_3 + m_4 \\ m_3 + m_4 \\ m_3 + m_4 \\ m_3 + m_4 \end{bmatrix}$$



Figure 2 Proposed Architecture 1-D for Low Pass Filter Using CSD Technique

IN Figure 2, apply CSD techniques step-1 all the input converts' binary number  $m_1 = 001$ ,  $m_2 = 010$ ,  $m_{=_3} 011$ ,  $m_4 = 100$ , m = 101Step-2 all the binary input applied to sign extension so, s(1) = 0001, s(2) = 0010, s(3) = 0011, s(4) = 0100, s(5) = 0101Step-3 all the sign extension input applied to adder array so, m(1) = 0110, m(2) = 1110, m(3) = 1000, m(4) = 0101, m(5) = 0111, m(6) = 1001, m(7) = 1000 $m(8) = not(m_3 + m_4) + 1 = 1001$ 

Step-4 the entire adder array input applied to MUX so,

The entire adder array input m(1) right shift 1-bit so

$$MUX(1) = 0.0110 = Y_{p}(0)$$

MUX (1) add MUX (2) =  $Y_P(1)$ = 0'0110 = 1110 + 100010

Output of the  $Y_P(1)$  again right shift 1-bit and adds MUX (3) so

= 0'100010 = 1 000 + 1 100010

Continuous the process one by one, after then calculate the final output

 $Y_{P}(7) = 00001111010 = 122$ 

Carry is rejected.

For 2-D sub-band WT, the outputs of 1-D high pass and low pass filters  $Y_{H1}$  and  $Y_{L1}$  are passed through series of shift registers and then we take the samples parallel using parallel data access method. The parallel data access method is used to minimize the memory requirement in 2-D sub-band WT.

Т

## 4. SIMULATION RESULT

All the designing and experiment regarding algorithm that we have mentioned in this paper is being developed on Xilinx 6.2i updated version. Xilinx 6.2i has couple of the striking features such as low memory requirement, fast debugging, and low cost. The latest release of ISE<sup>TM</sup> (Integrated Software Environment) design tool provides the low memory requirement approximate 27 percentage low. ISE 6.2i that provides advanced tools like smart compile technology with better usage of their computing hardware provides faster timing closure and higher quality of results for a better time to designing solution. By the aid of that software we debug the program easily. Also included is the newest release of the chip scope Pro Serial IO Tool kit, providing simplified debugging of high-speed serial IO designs for Virtex-2 FX and Virtex-E LXT and SXT FPGAs. With the help of this tool we can develop in the area of communication as well as in the area of signal processing and VLSI low power designing.

We functionally 2-D sub-band WT verified presented in this paper including all low pass filter and high pass filter. We have been found from the results shown in table 1, that number of slices, number of slices LUTs and maximum combinational path delay used in different types of device family. RTL (resister transistor logic) view is 2-D sub-band tree structure in shown in figure 3.

#### Table 1 First Level DWT

| Selected Device : 7vh290thcg11<br>Slice Logic Utilization: | 55-2                 |  |  |  |  |  |  |
|------------------------------------------------------------|----------------------|--|--|--|--|--|--|
| Number of Slice Registers:                                 | 37 out of 437600 0%  |  |  |  |  |  |  |
| Number of Slice LUTs:                                      | 206 out of 218800 0% |  |  |  |  |  |  |
| Number used as Logic:                                      | 205 out of 218800 0% |  |  |  |  |  |  |
| Slice Logic Distribution:                                  |                      |  |  |  |  |  |  |
| Number of LUT Flip Flop pairs used: 233                    |                      |  |  |  |  |  |  |
| Number with an unused Flip Flop: 196 out of 233 84%        |                      |  |  |  |  |  |  |
| Number with an unused LUT:                                 | 27 out of 233 11%    |  |  |  |  |  |  |
| Number of fully used LUT-FF pairs: 10 out of 233 4%        |                      |  |  |  |  |  |  |
| Number of unique control sets: 2                           |                      |  |  |  |  |  |  |
|                                                            |                      |  |  |  |  |  |  |
| IO Utilization:                                            |                      |  |  |  |  |  |  |
| Number of IOs:                                             | 29                   |  |  |  |  |  |  |
| Number of bonded IOBs:                                     | 29 out of 300 9%     |  |  |  |  |  |  |
|                                                            |                      |  |  |  |  |  |  |

Minimum period: 1.155ns (Maximum Frequency: 865.801MHz)

Minimum input arrival time before clock: 0.472ns

Maximum output required time after clock: 10.782ns

Maximum combinational path delay: 9.815ns

#### Table 2 Second Level DWT

| selected Device : 7vh290thcg1155-2                                                                                                                                                              |                                                                      |  |  |  |  |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|--|--|--|--|--|
| Slice Logic Utilization:<br>Number of Slice Registers:<br>Number of Slice LUTs:<br>Number used as Logic:                                                                                        | 233 out of 437600 0%<br>975 out of 218800 0%<br>975 out of 218800 0% |  |  |  |  |  |
| Slice Logic Distribution:<br>Number of LUT Flip Flop pairs u<br>Number with an unused Flip Fli<br>Number with an unused LUT:<br>Number of fully used LUT-FF p<br>Number of unique control sets: | op: 836 out of 1069 78%<br>94 out of 1069 8%                         |  |  |  |  |  |
| IO Utilization:<br>Number of IOs: 4<br>Number of bonded IOBs:                                                                                                                                   | 14<br>44 out of 300 14%                                              |  |  |  |  |  |

Minimum period: 6.284ns (Maximum Frequency: 159.132MHz) Minimum input arrival time before clock: 5.945ns Maximum output required time after clock: 17.713ns Maximum combinational path delay: 17.411ns



Figure 3 RTL View of Second Level Wavelet Transform

Table 3 Comparison result of existing algorithm and proposed algorithm

| Design          | Number of<br>Slice LUT | Slice Registers | Minimum Period | Total Memory<br>Usage |
|-----------------|------------------------|-----------------|----------------|-----------------------|
| Linning Vo      | -                      | 159             | 1.78 nsec      | 447                   |
| Linning Ye      | -                      | 384             | 12.9 nsec      | 204.7                 |
| Proposed Design | 206                    | 37              | 1.155 nsec     | 316                   |
|                 | 975                    | 233             | 6.284 nsec     | 240                   |



#### **5.** Conclusion

sub-band wavelet transform standardize two basic blocks for representing the image compression namely, low pass filter and high pass filter. Wavelet transforms a vast application in many areas like image compression, signal processing and VLSI design. We propose a 2-D sub-band novel distributed arithmetic paradigm named CSD structure for VLSI implementation of digital signal processing (DSP) algorithms involving inner product of vectors and vector-matrix multiplication. We demonstrate that CSD is a very efficient architecture with adders as the main component and free of ROM (free memory), multiplication, and subtraction. For the adder array, a systematic approach is introduced to remove the potential redundancy so that minimum additions are necessary.

#### References

- [1] S.G. Mallat, A Theory for Multiresolution Signal Decomposition: The Wavelet Representation, IEEE Trans. on Pattern Analysis on Machine Intelligence, 110. July1989, pp. 674–693.
- [2] M. Alam, C. A. Rahman, and G. Jullian, Efficient distributed arithmetic based DWT architectures for multimedia applications, in Proc. IEEE Workshop on SoC for real-time applications, pp. 333 336, 2003.
- [3] X. Cao, Q. Xie, C. Peng, Q. Wang and D. Yu, An efficient VLSI implementation of distributed architecture for DWT," in Proc. IEEE Workshop on Multimedia and Signal Process., pp. 364–367, 2006.
- [4] Archana Chidanandan and Magdy Bayoumi, Area-Efficient Csd Architecture For The 1-D DCT/IDCT," ICASSP 2006.
- [5] M. Martina, and G. Masera, Low-complexity, efficient 9/7 wavelet filters VLSI implementation, IEEE Trans. on Circuits and Syst. II, Express Brief 53(11), pp. 1289–1293, Nov. 2006.
- [6] M. Martina, and G. Masera, Multiplierless, folded 9/7-5/3 wavelet VLSI architecture, IEEE Trans. on Circuits and syst. II, Express Brief 54(9), pp. 770–774, Sep 2007.
- [7] Gaurav Tewari, Santu Sardar, K. A. Babu, High-Speed & Memory Efficient 2-D DWT on Xilinx Spartan3A DSP using scalable Polyphase Structure with DA for JPEG2000 Standard, 978-1-4244-8679-3/11/\$26.00 ©2011 IEEE.
- [8] B. K. Mohanty and P. K. Meher, Memory Efficient Modular VLSI Architecture for Highthroughput and Low-Latency Implementation of Multilevel Lifting 2-D DWT, IEEE Transactions on Signal Processing, 59(5), 2011.
- [9] Nitin S. Sonar and Dr. R.R. Mudholkar, Implementation of Data Error Corrector Using VLSI Technique, *International Journal of Electronics and Communication Engineering and Technology*, 5(5), 2014, pp. 91–95.
- [10] B. K. Mohanty and P. K. Meher, Memory-Efficient High-Speed Convolution-based Generic Structure for Multilevel 2-D DWT, IEEE Transactions on Circuits Systems for Video Technology.
- [11] B. K. Mohanty and P. K. Meher, Efficient Multiplierless Designs for 1-D DWT using 9/7 Filters Based on Distributed Arithmetic, ISIC 2009.